Cell Genomics — Latest Matching Preprints

1

Meta-analysis of over 8,000 individuals from Hawai'i and Samoa for genetic associations to cardiometabolic phenotypes

Dinh, B. L.; Wang, X.; Sheng, X.; Wan, P.; Srivastava, A. K.; Naseri, T.; Viali, S.; Wilkens, L.; Le Marchand, L.; Haiman, C. A.; Weeks, D.; Chiang, C. W. K.; Carlson, J. C.

2026-05-12 genetic and genomic medicine 10.64898/2026.05.08.26352761 medRxiv

Top 0.1%

28.1%

Show abstract

Although genome-wide association studies (GWAS) now routinely reveal genetic associations and biological insights in millions of individuals, underrepresentation of global populations, such as those from Polynesia, continue to persist. These exclusions, often driven by logistical challenges and lack of data, prevent systematic identification of population-enriched associations, such as the association of the missense variant at the CREBRF locus to BMI and type 2 diabetes discovered commonly occurring in Polynesian populations due to its rarity in global populations. Armed with the recently updated TOPMed imputation panel that could benefit studies in diverse populations that previously had poorer imputation performance, we performed the first GWAS of Native Hawaiians and largest to date of Polynesian-ancestry populations (combined N up to 8,461) to identify population-enriched associations for 13 adiposity and cardiometabolic traits available across both cohorts: BMI, fasting glucose, fasting insulin, HDL, height, hip circumference, HOMA-IR, LDL, T2D, total cholesterol, triglycerides, waist circumference, and waist-hip ratio. We found 25 trait-loci associations that met genome-wide significance: 20 previously reported or known associations and 5 associations newly confirmed via meta-analysis. In particular, with improved statistical power, we were able to confirm the suspected association between the missense CREBRF variant with fasting glucose levels. The remaining 4 potentially novel loci-trait associations for BMI, LDL, and waist-hip ratio, however, were not replicated in multi-ethnic datasets from All-of-Us despite having reasonable power to replicate. The lack of Polynesian-enriched findings outside of the CREBRF locus informs the bounds of the effect sizes or frequency of any enriched variants, and suggests that further expansion of cohort sizes from this region of the world and improved imputation references specific to these populations are needed to identify more population-enriched associations.

2

Phenome-derived polygenic scores and social determinants jointly shape context-dependent disease risk

Wang, Y.; Truong, B.; Lu, W.; Fadil, C.; He, Y.; Luo, W.; Koyama, S.; Tsuo, K.; Paruchuri, K.; Yu, Z.; Hull, L. E.; Zheng, Z.; Carey, C. E.; Walters, R. K.; Neale, B. M.; Robinson, E. B.; Kraft, P.; Natarajan, P.; Martin, A. R.

2026-04-18 genetic and genomic medicine 10.64898/2026.04.16.26351039 medRxiv

Top 0.1%

26.7%

Show abstract

Polygenic scores (PGS) are typically derived from single-trait genome-wide association studies (GWAS), yet many complex diseases arise from shared genetic liability distributed across correlated clinical dimensions. Accordingly, disease risk depends not only on how genetic liability is represented but also on the social context in which that liability is expressed. Whether phenome-derived latent factors improve prediction, and how social determinants of health (SDoH) modify the realized utility of PGS, remains unclear. Here we constructed PGS for 35 orthogonal latent phenomic factors derived from 2,772 phenotypes in 361,114 UK Biobank (UKB) participants and evaluated their phenomic specificity, cross-dataset portability and predictive performance relative to conventional disease-specific PGS across the UKB holdout, Mass General Brigham Biobank and the All of Us (AoU) Research Program. Factor-based PGS showed widespread, biologically coherent phenome-wide associations that were reproducible across biobanks and ancestries. Their predictive utility, however, was strongly disease dependent. For asthma, a respiratory factor PGS outperformed an internally derived disease-specific PGS and showed superior cross-ancestry portability, retaining 41.5% of European-ancestry predictive accuracy in African-ancestry individuals, compared with 22.9% for an asthma PGS derived from the largest available multi-ancestry GWAS. By contrast, disease-specific PGS remained superior for coronary artery disease (CAD) and type 2 diabetes (T2D). These findings suggest that phenome-derived aggregation is most beneficial when disease-specific GWAS incompletely capture underlying liability, including settings of biological heterogeneity or imprecise phenotyping. We then evaluated SDoH in AoU as a complementary axis shaping prevalent disease prediction beyond genetic susceptibility. Across all three diseases, SDoH contributed substantial and largely independent predictive information beyond the disease-optimal genetic model. SDoH also modified how genetic liability translated into observed disease prevalence: for asthma and CAD, genetic stratification attenuated with increasing social burden, whereas this attenuation was substantially weaker for T2D. As a result, the same genetic percentile corresponded to different standardized predicted prevalences across social strata, reflecting disease-specific shifts in baseline prevalence, genetic gradients and calibration. Together, these findings indicate that disease risk is shaped by both genetic liability and the social context in which that liability is realized. Phenome-derived PGS improve prediction under specific architectural conditions, whereas social context independently modifies the performance, calibration and interpretation of genetic risk across populations.

3

The 16p11.2 microdeletion enhances gene expression variability between human IPSC derived forebrain interneuron progenitor cells in culture.

Yang, Y.; Quintana-Urzainqui, I.; Pratt, T.

2026-05-24 genetic and genomic medicine 10.64898/2026.05.21.26353723 medRxiv

Top 0.1%

22.8%

Show abstract

The 574 kilobase pair 16p11.2 microdeletion raises a person's odds for neurodevelopmental and energy balance conditions, particularly autism and obesity. There is considerable clinical heterogeneity and how much this reflects genetic versus environmental or stochastic factors is unclear. Forebrain interneurons originate from progenitors residing in the ventricular zone of the foetal ventral telencephalon and their perturbation is implicated in a number of 16p11.2 phenotypes prompting investigation of how the 16p11.2 microdeletion impacts their development. We differentiate human induced pluripotent stem cells (IPSCs), isogenic except for heterozygous 16p11.2 microdeletion to minimise confounding effects of genetic background, to ventral telencephalic interneuron progenitor fate in 2D culture and use single cell RNA sequencing to obtain single cell transcriptome populations for comparative bioinformatics. Hundreds of transcripts are differentially expressed and many associate with cell signalling, chromatin, neurodevelopmental conditions including autism, and obesity. Pertinently, we find that transcript level variation is significantly greater in 16p11.2 heterozygous progenitors than their isogenic wild type counterparts and this holds for sets of genes comprising regulons, gene-sets functionally connected by transcription factor regulation, and for randomly selected gene-sets indicating that the 16p11.2 locus itself has a genome-wide property in stabilising transcription between cells. Regulons with greatest increased variation in 16p11.2 heterozygous progenitors exhibit strong enrichment for cell cycle related genes, resonating with our earlier finding of increased cell cycle variability between 16p11.2 heterozygous organoids, and many are regulated by transcription factors associated with autism and/or obesity enforcing the idea that unusual transcriptional variation itself contributes to phenotypes.

4

Improving isoform-level eQTL and integrative genetic analyses of breast cancer risk with long-read RNA transcript assemblies

Head, S. T.; Nemani, A.; Chang, Y.-H.; Harrison, T. A.; Bresnahan, S. T.; Rothstein, J. H.; Sieh, W.; Lindstroem, S.; Bhattacharya, A.

2026-03-23 genomics 10.64898/2026.03.22.713514 medRxiv

Top 0.1%

21.9%

Show abstract

Most eQTL and TWAS analyses quantify expression using aggregate, tissue-agnostic transcript annotations and ignore isoform-level regulation, potentially obscuring or misattributing regulatory mechanisms. Here, we developed a framework leveraging publicly available long-read RNA-seq data to perform tissue-informed inference of genetic regulation and prioritize candidate causal isoforms for breast cancer risk. We quantified gene- and isoform-level expression in breast tumor (TCGA), non-cancerous mammary tissue, and cultured fibroblasts (GTEx) using three transcriptome annotations: standard GENCODE, tissue-specific long-read-derived assemblies, and combined annotations incorporating transcript-isoforms from both. While GENCODE cataloged over 250,000 pan-tissue isoforms, the tissue-specific long-read assemblies captured reduced sets of 74,717 isoforms in tumor, 48,057 in fibroblasts, and 22,941 in healthy breast. We performed eQTL mapping and fine-mapping, followed by colocalization with overall and subtype-specific breast cancer GWAS and isoform-level TWAS. While most eGenes were concordant across annotations, approximately 1/3 of lead cis-eQTLs for shared eGenes differed between long-read assemblies and GENCODE. Further, eIsoform discovery was highly annotation-specific. In healthy breast tissue, the gold standard tissue for building gene expression prediction models for TWAS of breast cancer, 46% of eIsoforms identified by the long-read annotation were unique to that annotation even though 93.7% of them are present in GENCODE. Despite combined annotations expanding the GENCODE catalog by only 0.6-7.6% depending on tissue source, 69% of unique significant isoform-trait associations were specific to a single annotation. Long-read-informed annotations uncovered regulatory associations entirely missed by GENCODE, including a candidate regulatory isoform at the MARK1 locus captured only in fibroblasts and a previously unannotated splice variant prioritized as the likely effector transcript at NUP107. These findings demonstrate that transcript annotation is not merely a technical consideration but critically defines the biological hypothesis space for regulatory mechanisms and shapes discovery. Incorporating tissue-resolved isoform annotations from long-read RNA-seq improves the specificity of regulatory inference and enhances identification of candidate causal isoforms at GWAS loci.

5

Multiplex Portuguese Families as a Lens into rare mutations and the Shared Genetic Architecture of Schizophrenia, Mood Disorders, and Autism Spectrum Disorders

Pato, C. N.; Pato, M. T.; Mulle, J.; Hart, R. P.; Pang, Z.; Knowles, J. A.; Singh, T.; Maddhesiya, P.; Carvalho, C.; Merikangas, A.; Medeiros, H.; Bigdeli, T. B.; Kazemi, H.; Drake, J.; Vladimrov, V.; Maher, B.; Bacanu, S.-A.; Neale, B.; Fanous, A.

2026-04-07 genetic and genomic medicine 10.64898/2026.04.06.26350177 medRxiv

Top 0.1%

18.4%

Show abstract

In an analysis of 173 multiplex families from the Portuguese Island Collection (PIC) we characterize the shared genetic architecture of serious mental illnesses (SMI) including schizophrenia (SZ), bipolar disorder (BP), major depression (MDD), and autism (ASD). Within this cohort, co-segregation of psychotic and mood disorders occurred in 28% of families, while 7% demonstrated co-segregation of intellectual disability or ASD with SZ and mood disorder phenotypes. Whole-genome sequencing (WGS) was performed on a three-generation PIC family to identify rare, large-effect variants. We identified an extremely rare predicted loss of function (LoF) mutation in the Chromodomain Helicase DNA Binding Protein 2 (CHD2) gene. These results demonstrate that high-density multiplex families in founder populations are a powerful resource for mapping rare, large-effect variants that cross clinical diagnostic boundaries, as the identified CHD2 mutation suggests that the disruption of a single neurodevelopmental gene may lead to diverse SMI phenotypes. By combining population and family-based methodologies, this approach leverages shared genetic backgrounds and environments to provide a unique opportunity for cellular studies to explore the biological mechanisms underlying SMI, offering significant potential to inform future functional research and identify novel therapeutic targets.

6

Transposable element-host genome evolutionary arms race revealed by multi-modal epigenomic profiling in a telomere-to-telomere human genome reference

Nikitin, D.

2026-03-23 genomics 10.64898/2026.03.19.712972 medRxiv

Top 0.1%

18.0%

Show abstract

For a quarter of a century transposable elements have been recognized as a major component of the human genome, comprising 46.1% according to recent estimates, and as key drivers of regulatory innovation as well as participants in an ongoing evolutionary arms race with host defense systems. Using the newly released T2T ENCODE dataset, we quantified the epigenetic impact of 3.7 million transposable elements across evolutionary time by analyzing seven epigenomic modalities in twelve human cell lines, spanning six transposon classes, 44 families, and 1,122 subfamilies. We show that SVA elements exhibit the strongest signatures of the arms race, characterized by progressive escape from H3K9me3-mediated heterochromatinization accompanied by increased acquisition of CTCF binding and enhancer-associated chromatin marks. Among Alu elements, the AluYb8 and AluYb9 subfamilies display age-dependent accumulation of CTCF binding, while seven LTR subfamilies (HERV16-int, MER11C, LTR43-int, HERVE-int, LTR22C, LTR5_Hs, HERVIP10FH-int) demonstrate dynamic evolutionary behavior within active chromatin, H3K9me3 chromatin and CTCF contexts. We further evaluated the relative contribution of distinct epigenomic modalities to the host-transposable element conflict and found that transposon-driven evolution is dominated by evasion of host-imposed heterochromatinization primarily at H3K9me3 and secondarily at H3K27me3, together with progressive invasion into CTCF-rich regions. In contrast, enhancer, promoter, and H3K36me3 marks appear to play more limited roles. Collectively, these findings deepen our insight into the coevolutionary epigenomic dynamics between human genome and transposable elements and the associated processes driving regulatory innovation.

7

Benchmarking of local ancestry inference with different assays and parameters

Motegi, T.; Huang, F.; Campbell, J. D.

2026-05-21 genomics 10.64898/2026.05.18.726085 medRxiv

Top 0.1%

17.8%

Show abstract

Local ancestry inference (LAI) enables high-resolution characterization of chromosomal segments inherited from distinct ancestral populations, offering unique insights into genetic architecture in admixed cohorts. While LAI is commonly performed with high-coverage whole-genome sequencing (WGS), the ability of other genotyping assays or varying sequencing depths has not been thoroughly benchmarked. In this study, we systematically evaluated the accuracy of LAI across SNP microarrays, whole-exome sequencing (WES), and ultra low-pass WGS (ULP-WGS) using diverse validation samples and state-of-the-art imputation pipelines. We show that ULP-WGS, when paired with GLIMPSE2, achieves robust accuracy at 0.25x coverage with a minimum genome window size of 0.5 centimorgans, with mean accuracy minus one standard deviation exceeding 95%. For WES, using "on-target" reads alone yields suboptimal performance, particularly for European and South Asian ancestries with accuracy less than 79.1% and 70.6%, respectively. However, incorporating "off-target" reads in WES and utilizing GLIMPSE2 substantially improved accuracy [≥]95% with a minimum window size of 0.2 centimorgans. We further evaluated formalin-fixed, paraffin-embedded (FFPE) samples and found that LAI could be performed successfully using WES data with accuracies of [≥]95% at a minimum window size of 0.5 centimorgans. In contrast, SNP microarrays did not achieve substantial accuracies at any window size ([≤]95%). Together, these results demonstrate that LAI is achievable without conventional high-coverage WGS and establish optimal parameters for LAI across platforms.

8

Allele-resolved monosome and polysome sequencing identifies functional cis-acting variants affecting mRNA translation efficiency

Alunno, L.; Massignani, I.; Hamadou, M. H.; Mazza, F.; Peroni, D.; Belli, R.; Dassi, E.; Romanel, A.; Inga, A.

2026-05-11 genetics 10.64898/2026.05.07.723424 medRxiv

Top 0.1%

17.4%

Show abstract

To prioritize germline genetic variants affecting mRNA fate at the post-transcriptional and translational levels, we leveraged sucrose-gradient-based isolation of 80S monosomes and polysomes, followed by mRNA retrieval and paired-end sequencing. Total cytoplasmic RNA was also sequenced for comparison. Experiments were performed in the non-transformed cell line RPE-1, cultured under basal conditions or upon p53 activation by Nutlin. Differential gene expression analysis confirmed a canonical p53 response. Heterozygous SNPs and SNVs were identified from the RNA-seq data, and allelic fractions (AF) were calculated for total, monosomal, and polysomal mRNAs. Variants showing reproducible AF differences across fractions beyond experimental variability were defined as tranSNPs. Among nearly 7000 heterozygous variants analyzable in polysomal or total RNA and over 5000 in monosomal mRNA, 1155 displayed a significant imbalance. Reporter assays performed in both RPE-1 and HCT116 cells validated allelic or haplotype effects for 17 selected variants in UTRs and coding regions, confirming differences in 15 cases, with evidence of cell line-specific responses. Proteomic analysis further supported allelic imbalance for selected missense variants. Overall, tranSNPs were identified in a non-transformed cell line at frequencies comparable to those in cancer cells, thereby extending their implications in human physiology. Further, monosome profiling enabled improved detection sensitivity of tranSNPs without positional bias, suggesting that 80S profiling improves detection of allele-specific translational regulation in RPE-1 cells. Graphical Abstract O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=121 SRC="FIGDIR/small/723424v1_ufig1.gif" ALT="Figure 1"> View larger version (36K): org.highwire.dtl.DTLVardef@408f9dorg.highwire.dtl.DTLVardef@94a631org.highwire.dtl.DTLVardef@12af1d4org.highwire.dtl.DTLVardef@6abff4_HPS_FORMAT_FIGEXP M_FIG C_FIG

9

Autoimmune non-coding variants perturb transcription factor-cofactor complex assembly linked to enhancer activity

Dashtiahangar, M.; Siggers, T.

2026-05-22 genomics 10.64898/2026.05.20.726379 medRxiv

Top 0.1%

14.7%

Show abstract

Most autoimmune disease-associated variants lie in non-coding regions, but the molecular mechanisms linking these variants to gene regulation remain poorly understood. A major unresolved challenge is to determine how disease alleles alter transcription factor (TF) binding, cofactor (COF) recruitment, and enhancer activity at scale. Here, we used the CASCADE method to profile differential binding of five TFs and ten COFs to 2,901 autoimmune disease-associated variants in Jurkat T cells, identifying 516 binding-modulating variants. Variants impacting binding were enriched among MPRA-defined expression-modulating variants and were strongly concordant with allele-specific reporter expression, linking altered TF/COF recruitment to enhancer activity. A majority of variants perturb binding of five major TF families -- ETS, RUNX, SP/KLF, OVOL/MYBL, and bHLH -- all of which have established roles in T cell biology. Notably, we find that ETS and RUNX factor binding is enriched at different variant functional classes, suggesting that they act through distinct regulatory mechanisms at disease loci. We describe allele-dependent regulator "switching" at several loci, where distinct complexes are found at reference and variants alleles, and we identify a recurrent regulatory module involving FOXM1 and the cofactors TIP60, BRD4, NCOA3, and NCOA1 assembling on ETS sites that tracks with gene expression. Together, this integrated biochemical and functional framework prioritizes autoimmune disease-associated variants by linking allele-specific TF/COF binding mechanisms to enhancer activity.

10

The MICA rs2596542 locus is not an island: component-aware haplotype decomposition reveals a composite MHC tag with distinct regulatory axes

Ichikawa, Y.

2026-05-18 genomics 10.64898/2026.05.15.725353 medRxiv

Top 0.1%

14.4%

Show abstract

Cross-population reversal of signed linkage disequilibrium (LD), or the "flip-flop" phenomenon, can arise when a tag SNP captures different extended haplotype backgrounds across populations. The MICA hepatocellular carcinoma susceptibility variant rs2596542 exemplifies this problem in the MHC, where signed LD reverses between Japanese and European populations but the relevant regulatory backgrounds are obscured by haplotypic complexity. We analyzed 7,303 biallelic SNVs surrounding rs2596542 across 26 populations using carrier-set topology classification followed by non-negative matrix factorization of carrier haplotypes. This identified two regulatory axes. Axis I, represented by components c4/c6, was population-stable and MICA-regulatory, with coherent MICA cis-eQTL enrichment and depletion for signed-LD reversal. Axis II, represented by component c5, was enriched for signed-LD reversal and showed an HLA-B{uparrow}/HLA-C{downarrow} expression signature with no MICA overlap across six GTEx tissues. In an independent Japanese HCC cohort (LIRI-JP, n = 122), Axis II-associated HLA-C downregulation remained after adjustment for clinical covariates, immune infiltration, and HLA-A expression. The previously proposed cross-population tag rs2244546 mapped to a population-stable component rather than Axis II. A parallel reanalysis of the COMT Val158Met flip-flop locus reproduced the signed-LD pattern reported by Lin et al1. and showed population-specific latent backgrounds among Val carriers. These results show that carrier-set topology combined with NMF can decompose composite marker alleles into functionally interpretable regulatory haplotype subspaces.

11

Functionally informed cis and trans proteome-wide association studies prioritize disease-critical genes

Hou, K.; Pazokitoroudi, A.; Strober, B.; Jiang, X.; Price, A. L.

2026-04-27 genetic and genomic medicine 10.64898/2026.04.24.26351667 medRxiv

Top 0.1%

13.9%

Show abstract

Proteome-wide association studies (PWAS) typically link genetically predicted protein levels to disease using cis-pQTLs, which can be limited by low cis-heritability for disease-critical genes under negative selection and by tagging due to co-regulation among nearby genes. Trans-pQTLs provide complementary information when large sample sizes are available to detect weak polygenic effects, enabling associations between trans-predicted protein levels and disease. We developed PolyPWAS, a functionally informed, summary statistics-based framework for associating both cis- and trans-predicted protein levels to disease. PolyPWAS integrates 96 functional annotations with proteome-wide pleiotropy to improve protein prediction, while correcting for PCs of predicted protein levels to limit tagging effects. We applied PolyPWAS to 2.8K plasma proteins measured in 34K UKB-PPP participants, analyzing GWAS summary statistics for 88 diseases and complex traits (average N=336K). Trans-predicted protein levels explained 21% of disease heritability (vs. 9.6% for cis-predicted protein levels), leveraging a 24% relative improvement in trans-prediction accuracy from functional priors. Trans-PWAS identified more significant protein-disease associations (and more conditionally significant associations) than cis-PWAS. Cis and trans associations showed only modest excess overlap (1.18, 95% CI: 1.11-1.26). Accordingly, combining evidence from cis and trans associations improved disease gene prioritization evaluated using gene sets from rare variant association studies (+11% relative improvement) and PoPS (+7.0% relative improvement) relative to cis-only approaches. PWAS associations to disease replicated across protein level cohorts, with strong UKB-PPP/deCODE concordance after adjusting for cohort-specific prediction accuracy. We provide examples where trans-regulatory effects link multiple disease-critical genes, underscoring the importance of integrating cis- and trans-regulatory effects to map protein-mediated disease biology.

12

Multi-ancestral GWAS with the VA Million Veteran Program enables functional interpretation of rheumatoid arthritis alleles

Sakaue, S.; Yang, D.; Zhang, H.; Posner, D.; Rodriguez, Z.; Love, Z.; Cui, J.; Budu-Aggrey, A.; Ho, Y.-L.; Costa, L.; Monach, P.; Huang, S.; Ishigaki, K.; Melley, C.; Tanukonda, V.; Sangar, R.; Maripuri, M.; Sweet, S. M.; Panickan, V.; McDermott, G.; Hanberg, J. S.; Riley, T.; Laufer, V.; Okada, Y.; Scott, I.; Bridges, S. L.; Baker, J.; VA Million Veteran Program, ; Wilson, P. W.; Gaziano, J. M.; Hong, C.; Verma, A.; Cho, K.; Huffman, J. E.; Cai, T.; Raychaudhuri, S.; Liao, K. P.

2026-04-23 genetic and genomic medicine 10.64898/2026.04.22.26351423 medRxiv

Top 0.1%

12.4%

Show abstract

Introductory ParagraphRheumatoid arthritis (RA) is a heritable and common autoimmune condition. To date, most genetic associations were derived from individuals with either European or East Asian ancestries. Here, we applied a multimodal automated phenotyping strategy to define RA and performed a genome-wide association study (GWAS) of RA in the Million Veteran Program (MVP), including underrepresented African American (AFR) and Admixed American (AMR) populations. Meta-analyses with previous RA cohorts identified 152 autosomal genome-wide significant loci, of which 31 were novel. Inclusion of multi-ancestry data dramatically improved fine-mapping resolution. Functional characterization of these loci using single-cell transcriptomic and chromatin data suggested new RA genes such as CHD7 and CD247. We identified underappreciated functional roles of fine-grained immune cell states other than T cells, such as B cell and myeloid cell states. We observed that multi-ancestry polygenic risk scores using our data demonstrated better predictive ability, especially for AFR and AMR populations.

13

A pan-cancer regulatory atlas of 6,983 GWAS variants prioritizes recurrent regulatory annotations and candidate programs at cancer risk loci

Dutta, S.

2026-05-20 genetic and genomic medicine 10.64898/2026.05.16.26353369 medRxiv

Top 0.1%

12.4%

Show abstract

Genome-wide association studies have identified thousands of cancer risk variants in non-coding regions, yet their regulatory mechanisms remain largely uncharacterized. Here we present a regulatory annotation atlas of 6,983 genome-wide significant variants across 23 cancer types, scored using multimodal AlphaGenome predictions and integrated with ENCODE-4, Roadmap Epigenomics, and JASPAR 2024 annotations. Most variants (70.5%) fall outside annotated cis-regulatory elements; 27.7% overlap enhancers and 1.4% overlap promoters. Comparison with 6,626 position-matched eQTL control variants suggests that enhancer-classified variants carry 1.86-fold higher predicted effects (P = 1e-94) and promoter variants 7.84-fold (P = 2.5e-19). A composite prioritization score (RegVar-basic, excluding GWAS-derived pleiotropy and TF disruption, AUC = 0.650; RegVar-full, AUC = 0.675) outperforms CADD (0.499) and LINSIGHT (0.558) in this cancer-gene discrimination benchmark. Within-locus ranking across 2,626 GTEx DAP-G eQTL credible sets shows that RegVar identifies the highest-posterior-probability variant in 47.3% of loci (P = 7.0e-13), while CADD performs at chance. Predicted target genes show 67.7% concordance with GTEx eQTL assignments. Permutation-controlled motif analysis highlights NFKB1, STAT1, IRF1, and ARNT as exploratory permutation-enriched candidate transcription factors at cancer risk loci. This atlas provides a resource for interpreting non-coding cancer susceptibility variants. Because AlphaGenome uses expression-related training data, GTEx-based validations should be interpreted as partially orthogonal rather than fully independent.

14

Sex stratified analyses enable new genetic insights into brain imaging phenotypes

Zhang, N.; Wang, S.; Fu, J.; Ji, Y.; Liu, N.; Qian, Q.; Xue, H.; Ding, H.; Liang, M.; Qin, W.; Xu, J.; Yu, C.

2026-04-21 genetics 10.64898/2026.04.21.719541 medRxiv

Top 0.1%

12.4%

Show abstract

Sex differences are commonly observed in neuroimaging phenotypes and in the risk of brain diseases, yet the underlying genetic mechanisms remain poorly understood. We investigated sex differences in the genetic architecture of 805 neuroimaging phenotypes in 22,950 males and 22,950 females matched for sample size and covariates, and systematically compared sex-stratified with sex-combined genetic analyses. We found eight variant-trait associations with significant sex differences, 235 fine-mapped sex-dominant causal associations, 457 sex-dominant colocalizations with sex hormones, and 96 sex-dominant colocalizations with schizophrenia. Compared with sex-combined analysis, sex-stratified analysis identified 47 new genetic associations, 170 new fine-mapped causal associations, 1,019 new colocalizations with sex hormones, and 191 new colocalizations with schizophrenia. Additionally, sex-stratified analysis improved global heritability and genetic-correlation estimates and enhanced polygenic prediction for certain phenotypes. This work highlights the need to routinely perform sex-stratified genetic association analyses to elucidate sex-specific and sex-shared genetic control of neuroimaging phenotypes and related disorders.

15

Transposable Elements Facilitate the De Novo Origin of Antifreeze Protein and the Diversification of Its Gene Family in Snailfishes

Rives, N.; Bajpai, P.; Zhuang, X.

2026-04-29 genetics 10.64898/2026.04.28.721326 medRxiv

Top 0.1%

12.4%

Show abstract

Transposable elements (TEs) are increasingly recognized as important sources of genomic innovation, yet mechanistically resolved examples of how they help generate new functional genes in vertebrates remain rare. Type I antifreeze proteins (AFPI) in fishes are life-saving adaptations shaped by strong freezing selection and provide an exceptional system for studying new gene evolution under extreme environmental pressure. We recently showed that AFPI in flounder, cunner, and sculpin evolved independently through distinct partial de novo routes, converging on a nearly identical alanine-rich antifreeze protein. Here, we elucidate the origin and evolution of AFPI in the last remaining unresolved lineage, snailfishes, using a chromosome-scale genome assembly for Liparis atlanticus together with multi-tissue Iso-Seq, tissue-specific RNA-seq, and comparative genomics across AFPI-bearing and AFPI-lacking snailfishes and teleost outgroups. We show that snailfish AFPI originated within Liparis and rapidly diversified as a young gene family with multiple isoforms and lineage- and population-specific copy-number change. Genome-wide homology searches support a de novo origin of the alanine-rich coding region from noncoding sequence rather than from a pre-existing protein-coding precursor. In contrast, the surrounding regulatory architecture was assembled through sequence recruitment: a hAT-derived fragment contributes promoter- and transcription-start-site-proximal sequence, and a conserved noncoding segment together with a Ty3/Gypsy-derived long terminal repeat (LTR) contributes the 3' regulatory region. TE-rich locus structure also provides plausible mechanisms for subsequent locus expansion and translocation. Together, these results reveal a TE-facilitated, mosaic route to new gene evolution in vertebrates, demonstrating how noncoding DNA, repetitive sequence, and TE-derived regulatory fragments can be assembled into a strongly selected adaptive innovation. Author SummaryWhere do new genes with brand-new functions come from? We tackled this question using one of evolutions clearest natural experiments: antifreeze proteins, life-saving molecules favored by selection because fish without them freeze in icy seawater. In this study, we show that mobile DNA called transposable elements helped build a new antifreeze gene in stages. Different transposable elements appear to have played different roles: one helped switch on a previously silent stretch of noncoding DNA, others contributed control sequences at the beginning and end of the gene, and repeat-rich DNA around the locus likely promoted gene duplication, movement to a new chromosomal location, and rapid diversification into a gene family. This is an unusually clear vertebrate example of how a new gene can emerge not in a single leap, but through stepwise assembly from different pieces of the genome. More broadly, our work shows that transposable elements do much more than disrupt genomes. Under strong natural selection, they can help turn noncoding DNA into a life-saving adaptation and then help that innovation expand and diversify.

16

Profiling Peripheral Blood with an Optimized, Multiplexed, Single-cell Multiome Approach Supports an Insulin-driven Asthma Subtype

Ding, J.; Kang, H.; Spangenberg, A. L.; Liu, Y.; Martinez, F. D.; Carr, T. F.; Cusanovich, D.

2026-03-30 genomics 10.64898/2026.03.27.714744 medRxiv

Top 0.1%

12.3%

Show abstract

RNA sequencing (RNA-seq) and the Assay for Transposase-Accessible Chromatin using sequencing (ATAC-seq) have become standard techniques for studying gene regulation in human populations. Single-cell (sc) "multiomic" genomic methodologies now enable researchers to dissect cellular heterogeneity while simultaneously measuring gene expression and chromatin accessibility within individual cells. However, single-cell approaches remain experimentally complex and cost-prohibitive, limiting their application in population studies, and motivating the development of new strategies for population-scale single-cell investigations. To this end, we have adapted and optimized a previous multiomic protocol, "Transcriptome, Epitope, and ATAC sequencing" (TEA-seq) through experimentation and simulation to incorporate sample multiplexing, thus resulting in our "multiplexed TEA-seq" (mTEA-seq) protocol. Using mTEA-seq, we sought to determine whether asthma that develops in conjunction with early-life elevated insulin levels might have an identifiable molecular signature. We studied samples from adult individuals (54 subjects, 272,003 cells) from the Tucson Childrens Respiratory Study (TCRS), a birth cohort phenotypically characterized over four decades, to identify unique molecular characteristics of blood cells from asthmatics who had high serum insulin levels at age 6. Using a Bayesian approach, we found striking sex-specific effects. Male asthmatic subjects with high insulin at age 6 displayed widespread immune transcriptional and epigenetic alterations into adulthood compared to male non-asthmatic subjects without elevated insulin at age 6. We also found that male non-asthmatics with early-life high insulin showed epigenetic perturbations in adulthood, but not transcriptional changes. The consistency of epigenetic signals between these two groups that had high insulin at age 6 was highly cell-type-specific. For example, CD14+ monocytes displayed broadly common insulin-associated chromatin remodeling regardless of asthma status, while NK cells exhibited unique patterns of insulin-associated epigenetic reprogramming depending on asthma status. Finally, genotyping performed directly from our single-cell data enabled cell type-specific cis-QTL mapping that suggested HLA-DQB1 and AHI as genes for future study in insulin-associated asthma. Our investigation of childhood insulin-associated asthma demonstrates a metabolically-driven alterations on immune cells persisting into adulthood, thus providing a molecular signature of this asthma subtype, and offering novel insights for disease prevention and therapeutic intervention.

17

PanTEon: a cross-kingdom framework to guide the design of transposable element classifiers

Orozco-Arias, S.; Ferrer-Pomer, I.; Rodrigues de Goes, F.; Gaviria-Orrego, S.; Gomiz-Fernandez, J.; Llatser-Torres, J.; Paschoal, A. R.; Guyot, r.; Gabaldon, T.

2026-04-04 bioinformatics 10.64898/2026.04.01.715927 medRxiv

Top 0.1%

12.2%

Show abstract

Transposable elements (TEs) are major drivers of genome evolution, yet their annotation and classification remain inconsistent and hard to reproduce across species. Fragmented repeats, lineage-specific innovations, and heterogeneous taxonomies across databases and tools complicate comparisons and slow progress in TE biology. To address this, we developed PanTEon, a cross-kingdom deep learning framework for reproducible TE classification that combines a harmonized database with an open, modular benchmarking platform. The PanTEon Database is an automatically curated, taxonomically broad TE repository spanning animals, plants, and fungi. The PanTEon platform standardizes training, evaluation, and inference across nine Machine Learning methods, while remaining extensible to user-defined architectures. Using this framework, we benchmark state-of-the-art Machine Learning-based TE classifiers across TE superfamilies and major eukaryotic lineages and find that performance varies markedly by kingdom and superfamily. Ensemble approaches and phylum-specific models improve predictive F1 scores, but cross-species generalization remains a major challenge. Together, PanTEon Database and PanTEon platform provide a reproducible, scalable, and extensible foundation for TE classification, enabling standardized evaluation of future AI methods and supporting community-driven annotation efforts.

18

Sex Steroid Hormone Signaling Tunes Metabolic and Neuronal Programs in Human Cortical Development

Berk-Rauch, H. E.; Gherghina, L.-Y.; Huang, L.; Brand, A. H.; Chakravarti, A.

2026-05-19 genomics 10.64898/2026.05.16.725519 medRxiv

Top 0.1%

12.1%

Show abstract

Autism spectrum disorder (ASD) exhibits a profound male biased sex ratio. While numerous genes have been implicated in ASD, the functional basis of this sex difference is unclear. One enticing hypothesis is genome-wide transcriptional regulation through estrogens and androgens. While hormone-mediated transcription is well-studied in reproductive tissues, its role in cortical development is poorly defined. Thus, we profiled androgen (AR) and estrogen (ESR1/ESR2) receptor expression in mid-gestation human fetal (GW16-24) cortex and complementary cortical organoid models, by single-cell RNA-seq. AR was primarily expressed in radial glia and intermediate progenitors while ESR1/ESR2 was more broadly distributed across multiple cell types of the developing cortex, although with the highest expression in radial glia. To study their genetic effects, we exposed iNeurons and cortical organoids to physiological levels of dihydro-testosterone (DHT) and estradiol (E2). DHT consistently up-regulated oxidative metabolism programs enriched in progenitor cells and down-regulated neuronal maturation pathways, while E2 exhibited a much more attenuated effect. The presence of DHT reduced NTRK2 (TrkB) expression, correlating with expression in fetal cortex where NTRK2 had significantly higher expression in progenitor cells of the female cortex, which is also reflected in the increased expression of AR in radial glia. Together, these data indicate that in developing human cortical lineages, sex hormones act as selective, cell-state-dependent modulators that tune metabolic and maturation programs rather than broadly reprogramming the genome. Thus, the effects of variation in transcriptional regulation through estrogens and androgens are likely to be minor, but not absent, in ASD.

19

Combinatorial epigenomic patterns define regulatory programs underlying disease heterogeneity

Shim, W. J.; Bao, S. C.; Chow, C. S. Y.; Mizikovsky, D.; Shen, S.; Riedlshah, Z.; Zhao, Q.; Boden, M.; Palpant, N.

2026-05-05 genomics 10.64898/2026.05.01.722123 medRxiv

Top 0.1%

12.0%

Show abstract

Disease is a heterogeneous process that involves multiple organs and cell types. Understanding how genomic variation contributes to disease requires approaches that move beyond the linear assumptions of additive models and resolve underlying disease pathways. While genome-wide association studies have catalogued hundreds of thousands of genomic variants linked to disease, our understanding of their cell-type specific roles remains largely limited, restricting our ability to translate genetic findings into targeted interventions. Here, we analyse consortium-scale epigenomic data spanning 833 biological samples across 8 epigenetic features to develop a generalisable machine learning framework that models the modular architecture of genome regulation. We define 720 epigenomic signatures, Epigenetically Co-Modulated Patterns (EpiCops), that capture co-regulated genomic regions with tissue and cell-specific regulatory activity. Using EpiCops, we effectively segregate functional genomic loci of mixed biological contexts, including cell-type specific enhancers, variants of complex traits and diseases. Applied to type-2-diabetes, EpiCops identify variant clusters associated with distinct biological pathways and organs, including clusters of opposing cardiovascular risk profiles driven by divergent organ-specific regulatory mechanisms. By integrating EpiCops with partitioned polygenic risk score, we further validate robustness of these variant clusters in independent cohort studies. Collectively, our study demonstrates EpiCops as a scalable framework for resolving the cell-type specific regulatory architecture of complex disease and advancing mechanistic understanding of disease processes.

20

Parental educational attainment polygenic scores contribute to phenotypic heterogeneity in offspring with autism

Gao, S.; Sui, Y.; Tian, P.; Rao, X.; Yan, C.; Xu, Y.; Wang, T.

2026-06-08 genetic and genomic medicine 10.64898/2026.06.03.26354779 medRxiv

Top 0.1%

12.0%

Show abstract

Educational attainment-related polygenic scores have been implicated in autism spectrum disorder (ASD), but how parental polygenic scores shape offspring phenotypes remains unclear. Using genotyping and exome-sequencing data from 142,357 individuals (55,252 ASD cases) in a large ASD cohort, we dissected the direct and indirect genetic effects of educational attainment-related polygenic scores on ASD phenotypes. Trio-model analyses showed that parental polygenic scores for educational attainment (PGSEA ) were associated with milder core ASD symptoms, including social deficits and repetitive behaviors, predominantly through indirect genetic effects, whereas their associations with comorbidities were driven predominantly by direct genetic effects. PGSEA was also significantly negatively associated with rare variant burden and prenatal factors, although these factors contributed largely independently to most phenotypes. Adjustment for full-scale intelligence quotient (FSIQ) and socioeconomic status (SES) partially attenuated the indirect effects of PGSEA on offspring phenotypes. Finally, higher parental PGSEA was associated with later age at diagnosis in offspring, partly through its protective effects on ASD phenotypes. These findings indicate that indirect genetic effects of parentalPGSEA contribute substantially to phenotypic variation in ASD and highlight family-mediated pathways as an important component of ASD heterogeneity.